Back

Frontiers in Digital Health

Frontiers Media SA

Preprints posted in the last 90 days, ranked by how well they match Frontiers in Digital Health's content profile, based on 20 papers previously published here. The average preprint has a 0.05% match score for this journal, so anything above that is already an above-average fit.

1
Does Recording Hardware Matter for Clinical Speech Recognition Evaluating ASR Performance Across Consumer Devices

Tran, B. D.; Hu, D.; Kim, S.; Guo, Y.; Mangu, R.; Reynolds, T. L.; Lafata, J. E.; Tai-Seale, M.; Zheng, K.

2026-05-22 health informatics 10.64898/2026.05.19.26353590 medRxiv
Top 0.1%
18.4%
Show abstract

Ambient clinical intelligence (ACI) systems use automatic speech recognition (ASR) to capture patient-provider conversations for downstream clinical documentation. However, many ASR evaluations are conducted under controlled conditions using specialized hardware. We evaluated how recording devices influence transcription performance of contemporary ASR engines applied to clinical dialogue. Thirty-five primary care encounters were re-enacted from transcribed conversations and recorded using five devices simultaneously: smartphone, laptop microphone, portable recorder, clip-on microphone, and a desktop microphone. Six ASR engines were evaluated using word error rate (WER), clinical concept extraction precision and recall, and sentence-level semantic similarity. Median WER ranged from 16.7% to 20.7% across engines. Engine choice produced larger variation in transcription performance than recording device, although device-related differences were statistically significant. Overall, contemporary ASR engines demonstrated relative robustness to consumer-grade recording hardware, suggesting that model selection may have greater impact on transcription performance than recording device configuration in real-world ACI deployments.

2
CUOREMA: Immersive Bio & Behavioral Feedback and Digital Interventions for Cardiac Rehabilitation - Exploratory Analysis

Svihrova, R.; Marzorati, D.; Odello, T.; Monachino, G.; Staletti, T.; Tieben, R.; Luigies, R.; Bodewes, N.; Rutten, W.; Barrett, G.; Bhogal, A.; Wilkinson, T.; Tzovara, A.; Faraci, F. D.

2026-05-15 rehabilitation medicine and physical therapy 10.64898/2026.05.15.26353188 medRxiv
Top 0.1%
18.3%
Show abstract

Cardiac rehabilitation is critical for secondary prevention, yet long-term adherence remains low. We present CUOREMA, a new personalized mobile health system integrating self-monitoring diaries, wearable data, virtual coaching, and reinforcement learning-enhanced adaptive interventions to support lifestyle change during and after outpatient cardiac rehabilitation. In a six-month two-center feasibility study (N = 53, Switzerland and France), we evaluated usability, engagement patterns, and preliminary health-related outcomes. Attrition was high: only 19\% of participants used the app on more than 100 days, and questionnaire response rates declined from 55\% at baseline to 13\% at six months. Despite these limitations, exploratory data-driven analysis revealed three distinct engagement clusters (dropout, sporadic, and consistent), which were further supported by matching patterns in app component usage, medication diary adoption, and smartwatch wearing time. Engagement clusters were not associated with demographic factors; instead, psychological themes of patients' personal goals suggested that intrinsic motivation characterized sustained users, whereas extrinsic motivation predominated among early dropouts. User experience was rated positively, and validated questionnaire scores showed no deterioration over time. One center demonstrated a statistically significant improvement in 6-minute walking test performance, though the study was not powered to detect clinical outcomes and selective dropout cannot be ruled out. These findings highlight engagement variability as a central challenge in digital cardiac rehabilitation and suggest that tailoring interventions to individual motivational profiles may improve long-term adherence.

3
Recovering Clinical Detail in AI-Generated Responses for Low Back Pain Through Prompt Design

Basharat, A.; Hamza, O.; Rana, P.; Odonkor, C. A.; Chow, R.

2026-04-23 pain medicine 10.64898/2026.04.21.26351437 medRxiv
Top 0.1%
14.6%
Show abstract

IntroductionLarge language models are increasingly being used in healthcare. In interventional pain medicine, clinical reasoning is essential for procedural planning. Prior studies show that simplified prompts reduce clinical detail in AI-generated responses. It remains unclear whether this reflects knowledge loss or simply prompt-driven suppression of information. MethodsWe performed a controlled comparative study using 15 standardized low back pain questions representing common interventional pain questions. Each question was submitted to ChatGPT under three conditions, professional-level prompt (DP), fourth-grade reading-level prompt (D4), and clinician-directed rewriting of the D4 response to a medical level (U4[->]MD). No follow-up prompting was allowed. Three physicians independently rated responses for accuracy using a 0-2 ordinal scale. Clinical completeness was determined by consensus. Word count and Flesch-Kincaid Grade Level (FKGL) were also measured. Paired t-tests compared conditions. ResultsAccuracy was highest with professional prompting (1.76). Accuracy declined with the fourth-grade prompt (1.33; p = 0.00086). When simplified responses were rewritten for clinicians, accuracy returned to baseline (1.76; p {approx} 1.00 vs DP). Clinical completeness followed the same pattern showing DP 80.0%, D4 6.7%, U4[->]MD 73.3%. Fourth-grade responses were shorter and less complex. Upscaled responses were more complex and similar in length to professional responses. Inter-rater reliability was low (Fleiss {kappa} = 0.17), but trends were consistent across conditions. ConclusionsReduced clinical detail under simplified prompts appears to reflect constrained output rather than loss of knowledge. Clinician-directed reframing restores omitted content. LLM performance in interventional pain depends strongly on prompt design and intended audience.

4
Clinician Experiences with Ambient AI Scribe Technology in Singapore: A Qualitative Study

Shankar, R.; Goh, A.; Xu, Q.

2026-03-19 health informatics 10.64898/2026.03.17.26348627 medRxiv
Top 0.1%
12.6%
Show abstract

BackgroundThe administrative burden of clinical documentation is a recognised contributor to clinician burnout and diminished care quality. Ambient artificial intelligence (AI) scribe technology, which uses large language models to passively record and summarise clinical encounters, has rapidly gained traction internationally. However, no published studies have examined clinician experiences with this technology in the Asia-Pacific region or within Singapores multilingual healthcare system. ObjectiveThis study explored clinician perspectives on ambient AI scribe technology at Alexandra Hospital, Singapore, focusing on perceived benefits, barriers, workflow integration, ethical considerations, and recommendations for sustained implementation. MethodsA qualitative descriptive study was conducted using semi-structured interviews with 28 clinicians across multiple specialties at Alexandra Hospital, National University Health System (NUHS). Participants were purposively sampled for diversity in role, specialty, and usage level. Interviews were analysed using reflexive thematic analysis guided by the RE-AIM/PRISM framework. The COREQ checklist was followed. ResultsFive themes emerged: (1) reclaiming presence in the clinical encounter, (2) navigating accuracy and trust in AI-generated documentation, (3) workflow disruption and adaptation, (4) privacy, consent, and ethical tensions within Singapores regulatory landscape, and (5) envisioning sustainable integration. Clinicians reported improved patient engagement and reduced cognitive burden. Persistent barriers included accuracy concerns, AI hallucinations, limited multilingual functionality, loss of documentation style, and uncertainties around compliance with the Personal Data Protection Act (PDPA). ConclusionsAmbient AI scribe technology holds promise for alleviating documentation burden in Singapores public healthcare system. Realising this potential requires attention to safety validation, multilingual capability, clinician training, and patient-centred consent aligned with local regulatory frameworks.

5
Prompt-engineering improves clinical safety of large language models for opioid equipotency conversion

Marton, T.; Corpman, D.; Lai, L.; Gabriel, R. A.; Chen, Y.

2026-05-08 pain medicine 10.64898/2026.05.06.26352590 medRxiv
Top 0.1%
12.4%
Show abstract

BackgroundLarge language models (LLMs) are increasingly used in medical education and clinical decision-making, but their reliability in high-risk medication dosing remains unclear. Opioid rotation is a common task requiring precise calculations where errors may result in overdose or inadequate pain relief. MethodsThirteen LLMs were tested using an API-based framework to ensure independent queries across trials. First, fictional clinical scenarios were tested to simulate real-world clinical situations involving opioid rotation; to test the effects of changes in wording, scenarios were revised into 4 "vignettes" showing the same clinical situation. Next, opioid pairs were tested with a random-dose paradigm across a clinically-pertinent range (5-120 mg daily morphine equivalents). LLM outputs were compared with expected values derived from reference standards. Accuracy was assessed using predefined safety thresholds: tight accuracy (0.85-1.15x expected dose) and broad accuracy (0.6-1.7x). We tested models naively and with prompts augmented with reference tables and unit explanations. ResultsNaive models generally exhibited low tight-range accuracy across opioid pairs. For any given opioid pair, each model would consistently produce similar incorrect conversion ratios despite wide variability across opioid pairs and language models. Vignette wording changes accounted for 76% of within-scenario response variance. Reference-based prompt augmentation significantly improved performance, with over half of models achieving high proportions of conversions within tight accuracy for morphine-equivalent conversions. ConclusionsWhile commercial LLMs demonstrated variable accuracy in the native state, prompt augmentation significantly improved their performance.

6
Ambient AI Documentation in Mixed-Language Encounters: A Heuristic Evaluation of Spanish-English and Mandarin-English Conversations

Hu, D.; Flores, D.; Flores, L.; Chien, R.; Lam, K.; Chow, E.; Guo, Y.; Tam, S.; Perret, D.; Pandita, D.; Zheng, K.

2026-05-22 health informatics 10.64898/2026.05.19.26353603 medRxiv
Top 0.1%
12.3%
Show abstract

Ambient AI documentation systems rely on automatic speech recognition to transcribe patient-provider conversations before generating clinical notes. However, little empirical evidence exists on how these systems perform in mixed-language clinical encounters. We conducted a mixed-method heuristic evaluation of an ambient AI documentation tool using 24 reenacted primary care conversations involving Spanish-English and Mandarin-English code-switching. Quantitative analyses measured mixed error rate (MER) and code-switching detection. Overall MER was low, with a median of 4% and less variation in Spanish-English conversations, and 9% in Mandarin-English conversations, but with outliers reaching 67%. The system generally detected language switches reliably, although deletions occurred frequently in Mandarin-English transcripts at switch points. Qualitative analysis revealed transcription errors related to phonetic similarity, automatic language translation, clinical terminology recognition, and language-specific challenges. These findings highlight considerations for improving ambient AI clinical documentation systems to support multilingual providers in delivering care for linguistically diverse populations.

7
Identifying clinician perceived priorities for a real-time wearable system for in-hospital monitoring: findings and evolutions following the COVID-19 pandemic

Vollam, S.; Roman, C.; King, E.; Tarassenko, L.

2026-04-24 health systems and quality improvement 10.64898/2026.04.21.26350610 medRxiv
Top 0.1%
10.1%
Show abstract

A Wearable Monitoring System (WMS), comprising a chest patch, wrist-worn pulse oximeter, and arm-worn blood pressure device, was developed in preparation for a pilot Randomised Controlled Trial (RCT) on a UK surgical ward. The system was designed to support continuous physiological monitoring and early detection of deterioration. An initial prototype user interface was developed by the research team based on prior clinical experience and engineering knowledge. To ensure suitability for clinical practice, iterative user-centred refinement was undertaken through a series of clinician focus groups and wearability assessments. Six focus groups were conducted between November 2019 and May 2021 involving multidisciplinary healthcare professionals. Feedback from these sessions informed successive interface and system modifications. System development spanned the COVID-19 pandemic, during which the WMS was rapidly adapted and deployed to support clinical care on isolation wards. Feedback obtained during this period was incorporated into later versions of the system and provided a unique opportunity to examine changes in clinician priorities under pandemic conditions. Clinicians consistently prioritised alert visibility, alarm fatigue mitigation, parameter flexibility, and centralised monitoring. Notably, preferences regarding alert modality and access mechanisms evolved over time: early enthusiasm for mobile or smartphone-type devices shifted towards a preference for fixed, ward-based displays and audible alerts at the nurses station following pandemic deployment. Building on previous wearability testing in healthy volunteers, wearability testing using a validated questionnaire was completed by 169 patient participants during the RCT. The chest patch and pulse oximeter demonstrated high tolerability, whereas the blood pressure cuff showed poor wearability and was removed from the final system. These findings demonstrate the importance of iterative, clinician-led design for wearable WMS and highlight how extreme clinical contexts such as the COVID-19 pandemic can significantly reshape perceived requirements for safety-critical monitoring technologies.

8
Corpus for Benchmarking Clinical Speech De-identification

Dai, H.-J.; Fang, L.-C.; Mir, T. H.; Chen, C.-T.; Feng, H.-H.; Lai, J.-R.; Hsu, H.-C.; Nandy, P.; Panchal, O.; Liao, W.-H.; Tien, Y.-Z.; Chen, P.-Z.; Lin, Y.-R.; Jonnagaddala, J.

2026-04-03 health informatics 10.64898/2026.03.31.26349906 medRxiv
Top 0.1%
10.1%
Show abstract

Objectives Publicly available datasets dedicated to clinical speech deidentification tasks remain scarce due to privacy constraints and the complexity of speech-level annotation. To address this gap, we compiled the SREDH-AICup sensitive health information (SHI) speech corpus, a time-aligned clinical speech dataset annotated across 38 SHI categories. Methods Two publicly available English medical-domain datasets were adapted to support speech-level de-identification, including script reformulation and controlled re-recorded by 25 participants. Additional Mandarin Chinese clinical-style materials were incorporated to extend linguistic coverage. All audio data were annotated with million-level, time-aligned SHI spans using Label Studio. Inter-annotator agreement was evaluated using Cohen's kappa, following iterative calibration rounds. The resulting corpus supports both automatic speech recognition (ASR) and speech-level recognition of SHIs. Results The final dataset comprises 20 hours of annotated audio, divided into training (10 hours, 1,539 files), validation (5 hours, 775 files), and test (5 hours, 710 files) subsets, totalling 7,830 SHI entities. The language distribution reflects the composition of the selected source materials, with 19.36 hours of English and 0.89 hours of Mandarin Chinese speech. Discussion The corpus exhibits a long-tail distribution consistent with clinical documentation patterns and highlights the limited availability of Chinese medical speech resources. These characteristics underscore both the realism of the dataset and structural challenges associated with multilingual speech de-identification. Conclusion The SREDH-AICup SHI speech corpus provides a clinically grounded, time-aligned speech dataset supporting automated medical speech de-identification research and facilitating future development of multilingual speech-based privacy protection systems.

9
AI Decision Support for Challenging Teledermatology Cases: MedGemma Performance in the Dermatology ECHO Program

Appiagyei, J. B.; Otu, R. O.; Henry, M. K.; Casterline, B. W.; Becevic, M.

2026-05-26 health informatics 10.64898/2026.05.21.26353523 medRxiv
Top 0.1%
8.9%
Show abstract

Teledermatology expands access to dermatologic expertise in rural settings, yet diagnostic uncertainty persists in low-resource primary care. This retrospective study evaluated MedGemma-4B-IT, a compact multimodal vision-language model, as adjunctive clinical decision support for challenging diagnostic cases. We analyzed 77 zero-concordance cases (360 clinical photographs) from a Dermatology Extension for Community Healthcare Outcomes (ECHO) tele-mentoring program (2016-2021). Zero-concordance cases showed no overlap between primary clinician provisional diagnosis and dermatologist-confirmed diagnosis. The model was prompted using dermatologist-style format to generate ranked differential diagnoses. Performance was assessed using strict case-level top-k exact-match accuracy and relaxed matching criteria based on fuzzy string similarity. MedGemma achieved 0.0% strict top-1 accuracy, 1.3% top-3 accuracy, 3.9% top-5 accuracy, and 3.9% top-10 accuracy. Relaxed concept-level matching achieved 28.6% top-1, 63.6% top-5, and 67.5% top-10 accuracy. Image-level accuracy was 44.2% (159/360, 95% CI 39.0-49.5%). The model surfaced the correct diagnosis within differential lists in 45.5% of cases despite no exact top-1 matches, suggesting utility for differential expansion rather than definitive diagnosis. Performance varied across diagnostic categories, with highest accuracy in Other categories (54.5%) and lowest in neoplastic conditions (0.0%). Common errors included confusion between inflammatory and other diagnostic groupings. These findings characterize MedGemma performance on real-world teledermatology cases and inform safe, clinician-in-the-loop integration into teledermatology workflows where specialist oversight remains essential.

10
Design of a Secure Wearable Health Data Sharing Platform for Region Hovedstaden: A FHIR DK and GDPR-Compliant Service Architecture

Chowdhury, A.; Irtiza, A.

2026-03-13 health systems and quality improvement 10.64898/2026.03.12.26348210 medRxiv
Top 0.1%
8.5%
Show abstract

The 1.8 million residents of Region Hovedstaden (Denmarks Capital Region) currently lack a secure, standardized pathway for integrating continuous wearable health data into Sundhed.dk, the national electronic health record. Consumer wearables such as Apple Watch, Oura Ring, and Garmin generate longitudinal physiological data relevant to chronic disease management, yet existing workflows rely on manual, non-standardized exports incompatible with FHIR DK v6.0.2 profiles and GDPR Article 25 privacy-by-design requirements. This paper presents a conceptual five-layer microservice architecture for secure wearable data sharing, employing MitID national authentication, National Service Infrastructure (NSI) integration, and Zero Trust security controls. Requirements were derived from a mixed-methods study including surveys of 47 Danish stakeholders and systematic benchmarking of existing platforms. Results show 51.1% conditional willingness to share wearable data under secure conditions, with audit transparency and non-medical misuse identified as central trust factors. Fourteen MoSCoW-prioritized requirements (F1-F7, NF1-NF7) are mapped to architecture components, providing a traceable blueprint for closing the interoperability gap in Danish public healthcare.

11
Comparative Evaluation of Wearable Sensor Form Factors for Physiological Monitoring in Youth with Autism Spectrum Disorder

Stewart, C.; Albertazzi, A.; Tasarz, J.; Kim, K.; Gandara, V.; Blucher, C.; Reyes-Martinez, C. C.; Smarr, B.; Besterman, A. D.

2026-05-07 health informatics 10.64898/2026.05.06.26352564 medRxiv
Top 0.1%
8.4%
Show abstract

Sudden behavioral outbursts in youth with autism spectrum disorder (ASD) are difficult to predict and create substantial caregiving burdens. Wearable physiological monitoring might enable prediction, but sustained use may be limited by tolerability. We evaluated adherence and data completeness in 40 youth with ASD over a two-week period across four device types (wristband, headband, adhesive chest patch, and finger ring) alongside caregiver-reported useability and comfort. Data completeness varied markedly by device, with the patch achieving the highest completeness ([~]80%), followed by the wristband ([~]60%), headband ([~]50%), and ring ([~]20%). In multivariate analyses, adherence was driven by the device form factor rather than participant-level clinical characteristics. Devices rated as more comfortable did not yield higher completeness, revealing a divergence between reported preference and actual use. These findings suggest that device choice is a key consideration for studies in ASD youths, highlighting the need for research into model stability across sensor types in neurodivergent populations.

12
No One Left Behind: Adaptive Tablet Modalities for Digitally Excluded Emergency Department Patients Design, Implementation, and Social Evidence for an Impairment-First Interface

Chowdhury, A.; Irtiza, A.

2026-04-13 health systems and quality improvement 10.64898/2026.04.11.26350686 medRxiv
Top 0.1%
8.4%
Show abstract

BackgroundThe urgent care departments in Europe face a structural paradox: accelerating digitalisation is accompanied by a patient population that is disproportionately unable to engage with standard digital tools. An internal analysis at the Emergency Department (Akutafdelingen) of Nordsjaellands Hospital in Hillerod, Denmark found that 43% of emergency patients struggle with digital solutions -- a figure that reflects the predictable composition of acute care populations rather than any individual failing. ObjectiveThis paper presents the design, iterative development, and secondary validation of the ED Adaptive Interface (v5): a prototype adaptive patient terminal developed in response to this challenge. The system operationalises what the author terms impairment-first design -- a methodology that treats the most constrained patient experience as the primary design problem and derives the standard experience as a subset. The interface configures itself in under ten seconds via nurse-led setup, adapting across four axes of impairment: visual, motor, speech, and cognitive. SystemVersion 4 supports five accessibility modes, a heatmap pain assessment grid, a Privacy and Dignity panel, a live workflow tracker with care notifications, structured dual-category help requests, and plain-language medical term definitions across four languages. Version 5, reported here for the first time, introduces a Condition Worsening Escalation button, a Referral Pathway Display, a "Why Am I Waiting?" triage explainer, a Symptom Progression Log, MinSP/Yellow Card Scan simulation, expanded language support (seven languages: English, Danish, Arabic with full RTL layout, Turkish, Romanian, Polish, and Somali), and an expanded ten-item Communication Board. The entire system runs as a single 79-kilobyte HTML file with zero infrastructure requirements. MethodsTo base the design on patient-generated evidence, two independent social media threads were subjected to an inductive thematic analysis (Braun and Clarke, 2006) a primary corpus of 83 entries in the Facebook group: Foreigners in Denmark (collected March 2026) and a corroborating corpus in an international community group in the Aarhus region (collected April 2026). All identifiers in both datasets were fully anonymised under GDPR Article 89 research provisions prior to analysis. No participants were contacted. Generative AI tools were used to assist with drafting, writing, and prototype code development in the preparation of this manuscript; all scientific content, data collection, analysis, and conclusions are the sole responsibility of the authors. ResultsThe first discourse corpus produced five major themes in relation to the five general problem areas that the prototype was intended to cover: system navigation and triage literacy gaps (31 entries); language and cultural barriers (6 entries); communication failures during care (5 entries); staff overload and capacity constraints (8 entries); and pain and severity assessment failures (14 entries). The supportive dataset supported all five themes on its own and presented two new themes: the different treatment of international patients and medical gaslighting as a long-term trend of patient advocacy failure. One of the major structural discoveries the five most-liked comments were critical of the original poster being self-referring to the ED when she had in fact been explicitly triaged to receive 1813 telephone referral to the ED directly inspired the Referral Pathway Display and Why Am I Waiting? features in v5. ConclusionsThe convergence of design rationale and independent social evidence across all five problem categories suggests that impairment-first design is not a niche accessibility concern but a structural approach to healthcare interface quality. The prototype is ready for a structured clinical pilot using the System Usability Scale (SUS) and semi-structured staff interviews. The long-term roadmap includes full MinSP integration, hospital PMS connectivity, and clinical validation.

13
Clinician Discourse on Ambient AI Scribes: A Reddit-based Topic Modelling and Sentiment Analysis

Shankar, R.; Xu, Q.

2026-04-30 health informatics 10.64898/2026.04.26.26351798 medRxiv
Top 0.1%
7.4%
Show abstract

BackgroundAmbient AI scribes are rapidly entering clinical workflows, yet end-user perspectives remain underrepresented in the peer-reviewed literature. Online clinician communities offer an unfiltered window into adoption barriers, perceived benefits, and product-level concerns. ObjectiveTo characterise themes and sentiment in clinician discourse on ambient AI scribes across professional Reddit communities. MethodsWe scraped posts from ten clinically oriented subreddits using twelve AI scribe related queries via the public Reddit JSON API. A two-tier keyword filter retained posts mentioning at least one AI scribe term and one clinical or workflow term. Texts were embedded with all-MiniLM-L6-v2, reduced via UMAP, clustered with HDBSCAN, and labelled using BERTopic with c-TF-IDF keyword extraction. Noise topics matching predefined off-topic patterns (for example, residency match, finance) were removed. Themes were assigned concise labels via Claude Sonnet 4. Sentiment was classified per post using cardiffnlp/twitter-roberta-base-sentiment-latest. ResultsAfter filtering, 176 unique relevant posts from seven active subreddits were retained, with r/FamilyMedicine (n = 64) and r/healthIT (n = 34) dominating. BERTopic produced 12 coherent themes spanning workflow integration, vendor comparison (DAX, Heidi, Freed, Abridge), HIPAA and privacy, mobile and device use, templates and formatting, and research versus clinical use. Overall sentiment was 61.4% neutral, 21.6% positive, and 17.0% negative. The most net-positive theme was DAX/Nuance/AI tools (about 55% positive); the most net-negative were charting fatigue and the freed-AI-scribes discussion thread (about 37 to 40% negative). Engagement (median upvotes and comments) was highest for tool-comparison and pricing themes, indicating salience of practical adoption questions. ConclusionsClinician sentiment toward ambient AI scribes is cautiously favourable but dominated by neutral, problem-solving discourse. Vendor selection, cost, HIPAA compliance, and EHR integration are the most actively debated issues. These insights can inform implementation strategy, vendor benchmarking, and policy guidance for ambient documentation tools.

14
Beyond AI Psychosis and Sycophancy: Structural Drift as a System-Level Safety Failure

Kim, J. E.; Holbrook, E. B.; Hron, J. D.; Parsons, C. R.

2026-03-19 health informatics 10.64898/2026.03.19.26346371 medRxiv
Top 0.1%
7.1%
Show abstract

BackgroundConversational AI safety systems are primarily evaluated using message-level content monitoring, which assesses inputs and outputs in isolation. This message-by-message approach can miss interaction-level risks that emerge over extended conversations, including patterns discussed in reports of "AI psychosis." Critically, by the time users express overt psychosis-spectrum content, opportunities for intervention may be limited. ObjectiveWe investigated whether LLM responses gradually expand and connect interpretations beyond the users original concerns, a process we term structural drift. We also tested whether this drift can be detected early and automatically. MethodsWe developed an automated, LLM-adapted rubric-based prompt for seven domains of anomalous (psychosis-spectrum) experience, derived from phenomenological psychiatry to capture subtle shifts in subjective interpretation. In Part 1, we evaluated the rubric using gold-standard text excerpts (N = 484) adapted from clinically validated qualitative instruments. In Part 2, we analyzed 1,290 user-LLM response exchanges from 7 dialogues, using 3 different LLMs (5 repeats each), to measure (i) domain amplification (increasing score within a domain) and (ii) domain expansion (new domains appearing over time). ResultsAutomated scoring showed strong agreement with gold-standard excerpts (domain accuracy 82.7-98.9%; exact 0-3 agreement 63.6-82.7%). Across dialogues, we observed significant amplification in four domains (p < .05; d = 0.14-0.46) and domain expansion in 83.8% of dialogues (88/105; p < .001). ConclusionsAI responses can systematically expand and intensify users descriptions beyond their initial input. Taken together with the predictive-processing accounts of psychosis, the exposure itself may reinforce maladaptive inferences. Because drift is detectable from ordinary dialogue without clinical-style probing, this structural drift detection may support scalable, real-time monitoring for emerging risks before overt escalation.

15
Combining Token Classification With Large Language Model Revision for Age-Friendly 4M Entity Recognition From Nursing Home Text Messages: Development and Evaluation Study

Amewudah, P.; Popescu, M.; Farmer, M. S.; Powell, K. R.

2026-04-01 health informatics 10.64898/2026.03.31.26349861 medRxiv
Top 0.1%
7.1%
Show abstract

Background: Secure text messages (TMs) exchanged among interdisciplinary care teams in nursing homes (NHs) contain clinical information that aligns with the Age-Friendly Health Systems 4Ms: What Matters, Medication, Mentation, and Mobility, yet, this information is not captured in any structured form, making it unavailable for systematic monitoring or quality reporting. Automatically extracting 4M information accurately and efficiently from these messages could enable several downstream applications within long term care settings. This task, however, is challenging because of the fragmented syntax, brevity, abbreviations, and informality of TMs. Objective: This study aimed to develop and evaluate a multi-stage 4M Entity Recognition (4M-ER) pipeline that combines a fine-tuned token classifier with large language model (LLM) revision, using only locally deployed open-source models, to improve 4M information extraction from clinical TMs. Methods: We used an expert-annotated dataset of 1,169 TMs collected from interdisciplinary teams across 16 Midwest NHs. The pipeline first identifies candidate text spans using a fine-tuned Bio-ClinicalBERT token classifier. A semantic similarity retriever then selects in-context exemplars to guide an LLM revision in which the LLM (Gemma, Phi, Qwen, or Mistral) performs boundary correction, label evaluation, and selective acceptance or rejection of candidate spans. Baselines for comparison included single-stage zero-shot LLMs, single-stage fine-tuned Bio-ClinicalBERT, and a fine-tuned LLM (Gemma) from a prior study. Ablation studies assessed the contribution of each pipeline stage and the effect of message filtering. Robustness was evaluated across 5 repeated runs. Results: The 4M-ER pipeline outperformed the previously fine-tuned Gemma LLM across all 4M domains, achieving F1 (entity type) improvements of +2 to +11 percentage points without any additional fine-tuning and at roughly half the GPU memory (12 vs 24 GB). It also improved upon single-stage fine-tuned Bio-ClinicalBERT in Mobility, Mentation, and What Matters (+0.02 to +0.05 F1). Error analysis showed that LLM revision reduced false positives by 25% to 35% by correcting misclassifications caused by conversational ambiguity, while the fine-tuned Bio-ClinicalBERT's high recall captured subtle entities that the fine-tuned Gemma missed. Silver data augmentation further improved the hardest domains, raising What Matters F1 from 0.59 to 0.67 and Mobility from 0.64 to 0.67. Ablation studies confirmed that restricting LLMs to revision only yielded optimal accuracy and efficiency. Conclusions: The 4M-ER pipeline enables accurate and scalable extraction of 4M entities from clinical TMs by combining fine-tuned Bio-ClinicalBERT with LLM revision using only locally deployed open-source models. The structured 4M data produced by the pipeline can support 4M taxonomy and ontology construction, as demonstrated in the prior work, and provides a foundation for downstream applications including real-time clinical surveillance, compliance with emerging age-friendly quality measures, and predictive modeling in long-term care settings.

16
Leveraging State-of-the-Art LLMs for the De-identification of Sensitive Health Information in Clinical Speech

Dai, H.-J.; Mir, T. H.; Fang, L.-C.; Chen, C.-T.; Feng, H.-H.; Lai, J.-R.; Hsu, H.-C.; Nandy, P.; Panchal, O.; Liao, W.-H.; Tien, Y.-Z.; Chen, P.-Z.; Lin, Y.-R.; Jonnagaddala, J.

2026-04-17 health informatics 10.64898/2026.04.13.26349911 medRxiv
Top 0.1%
7.0%
Show abstract

Accurate recognition and deidentification of sensitive health information (SHI) in spoken dialogues requires multimodal algorithms that can understand medical language and contextual nuance. However, the recognition and deidentification risks expose sensitive health information (SHI). Additionally, the variability and complexity of medical terminology, along with the inherent biases in medical datasets, further complicate this task. This study introduces the SREDH/AI-Cup 2025 Medical Speech Sensitive Information Recognition Challenge, which focuses on two tasks: Task-1: Speech transcription systems must accurately transcribe speech into text; and Task-2: Medical speech de-identification to detect and appropriately classify mentions of SHI. The competition attracted 246 teams; top-performing systems achieved a mixed error rate (MER) of 0.1147 and a macro F1-score of 0.7103, with average MER and macro F1-score of 0.3539 and 0.2696, respectively. Results were presented at the IW-DMRN workshop in 2025. Notably, the results reveal that LLMs were prevalent across both tasks: 97.5% of teams adopted LLMs for Task 1 and 100% for Task 2. Highlighting their growing role in healthcare. Furthermore, we finetuned six models, demonstrating strong precision ([~]0.885-0.889) with slightly lower recall ([~]0.830-0.847), resulting in F1-scores of 0.857-0.867.

17
The Golden Opportunity or the Cutting Room Floor? Quantifying and Characterizing the Loss and Addition of Social Determinants of Health during Clinician Editing of Ambient AI Documentation

Kim, S.; Guo, Y.; Sutari, S.; Chow, E.; Tam, S.; Perret, D.; Pandita, D.; Zheng, K.

2026-04-22 health systems and quality improvement 10.64898/2026.04.20.26351322 medRxiv
Top 0.1%
7.0%
Show abstract

Social determinants of health (SDoH) are important for clinical care, but it remains unclear how much AI-captured social context is preserved after clinician editing in ambient documentation workflows. We retrospectively analyzed 75,133 paired ambient AI-drafted and clinician-finalized note sections from ambulatory care at a large academic health system. Using a rule-based NLP pipeline, we extracted 21 SDoH categories and quantified retention, deletion, and addition. SDoH appeared in 25.2% of AI drafts versus 17.2% of final notes. At the mention level, AI captured 29,991 SDoH mentions, of which 45.1% were deleted, 54.9% were retained with clinicians adding 3,583 new mentions. Insurance and marital status were most often deleted, whereas substance use and physical activity were more often retained. Deletion patterns also varied by specialty, supporting the need for specialty-aware ambient AI systems.

18
When Data Meets Practice: A Qualitative Study of Clinician Perspectives on Streaming Data in Mental Health

Tian, J.; Kurkova, V.; Wu, Y.; Adu, M.; Hayward, J.; Greenshaw, A. J.; Cao, B.

2026-04-25 psychiatry and clinical psychology 10.64898/2026.04.23.26351640 medRxiv
Top 0.1%
6.9%
Show abstract

Patient-generated streaming data from wearable and digital technologies is increasingly promoted as a means of supporting mental health monitoring and clinical decision-making. While patient acceptance of these technologies has been reported, clinician perspectives remain underexplored despite their central role in determining whether streaming data are meaningfully integrated into routine care. This study explored clinicians experiences, as well as perceived facilitators and barriers, related to integrating patient-generated streaming data into routine mental health practice. A qualitative, exploratory interview study was conducted to examine clinicians experiences and perspectives on integrating patient-generated streaming data into mental health care. Semi-structured interviews were conducted with 33 clinicians, including family physicians (n=11), psychiatrists (n=12), and psychologists (n=10). Data were analyzed using reflexive thematic analysis guided by Braun and Clarkes six-step approach. Six themes were identified. Clinicians described variable use of digital and streaming technologies, ranging from routine engagement to deliberate non-use. Streaming data were viewed as clinically valuable when they provided longitudinal and objective insights, identified physiological and behavioural pattern changes, and supported patient engagement. However, clinicians emphasized that clinical usefulness was contingent on interpretability, contextual information, and relevance to decision-making. Major barriers included poor integration with electronic medical records, time constraints, data volume, limited organizational support, and uncertainty regarding data reliability and validity. Clinicians also expressed persistent concerns about privacy, governance, and regulatory oversight, highlighting the need for clear safeguards and accountability structures. Clinicians view patient-generated streaming data as a promising adjunct to mental health care, particularly for capturing longitudinal change between visits. However, meaningful clinical integration remains constrained by usability, workflow, organizational, and regulatory challenges, as well as limited confidence in data interpretation. Addressing these barriers through improved system integration, interpretive support, validation, and governance will be essential for translating the potential of streaming data into routine clinical practice. Author SummaryMental health symptoms can change between appointments yet care often depends on periodic visits and patient recall. Devices such as smartwatches and other digital tools can continuously collect information, from mood and sleep to activity and related measures, offering a possible way to support care outside the clinic. While patients are often seen as the main users of these tools, clinicians play a central role in deciding whether such technology is implemented in care. This study interviewed 33 mental health clinicians, including family physicians, psychiatrists, and psychologists, about their views on using patient-generated streaming data in routine care. Clinicians saw promise in these data as they help track changes over time, support discussions with patients, and provide additional insight between visits. However, they also described important barriers, including managing large amounts of data, limited integration with health record systems, uncertainty about data quality, and concerns about privacy and regulation. These findings suggest that successful implementation of streaming data in mental health care will depend on designing systems that are clinically relevant, easy to interpret, and supported by appropriate safeguards and infrastructure.

19
Prototyping a Generative AI-powered Person-centered Digital Health Tool to Mitigate Risk of Preventable Adverse Drug Events

Dobbins, D.; Russell, A.; Gunther, M.; Shetty, V.; Shomali, A.; Vawdrey, D.; Waring, S.; Whary, P.; Wong, J.; Wright, E. A.; Olson, A. W.

2026-06-04 health systems and quality improvement 10.64898/2026.06.02.26354712 medRxiv
Top 0.1%
6.8%
Show abstract

Objectives: Older adults with comorbidities and polypharmacy have disproportionately high risk of hospitalization as well as readmission from adverse drug events (ADEs), of which 28%-71% are preventable (pADEs). This paper introduces an LLM application, CommunicADE, designed to support risk-mitigation of pADE-related readmission for the aforementioned population. We aim to evaluate CommunicADE's technical performance with OpenAI's HealthBench criteria: accuracy, completeness, communication quality, context awareness, and instruction following. Materials and Methods: Our technical validation study used an LLM (KimiK2.5) to simulate interviews between CommunicADE and nine high-fidelity synthetic patients hospitalized and at increased risk for pADE-related readmission (65+ years, comorbidities, 5+ medications). Some pADE risk mechanisms clues were visible to CommunicADE in patient H&Ps, but most mechanisms were solely discoverable in interviews. Two pharmacists evaluated CommunicADE's interview questions and EHR notes with HealthBench-informed variables. Analyzes used descriptive statistics. Results: For 35 mechanisms across 9 patients (avg=3.89 mechanisms/patient), CommunicADE's precision and recall were 0.92 and 0.63, respectively. Hallucinations were absent. Coherence and person-centeredness scored 4.28 and 4.44 on a 5-point scale (5=highest). On average, communication was at a 5th grade level and objective for 78% of patients. Most patient-reported quotes included in notes (92%) supported detected mechanisms. CommunicADE followed all instructions regarding interview length and patient approvals. Discussion: CommunicADE's strongest performance was in accuracy (precision, hallucinations), communication quality (coherence, readability), context awareness (person-centeredness). Completeness (recall) and instruction following (objectivity, pADE mechanism/quote alignment) show room for improvement. Conclusion: Findings suggest technical readiness for a feasibility pilot with real-world patients, and key areas for performance improvement.

20
SmartAlert: Integrating Machine Learning and Alert Triggers into Live Electronic Medical Record Systems, Targeting Low-Yield Inpatient Lab Tests

Jiang, Y.; Ma, S.; Liang, A.; Kim, G.; Acharya, A.; Mony, S.; Punnathanam, S.; Makeown, J.; Jose, J.; Shieh, L.; Pham, T.; Ng, A. Y.; Chen, J. H.

2026-05-06 health informatics 10.64898/2026.04.29.26351965 medRxiv
Top 0.1%
6.7%
Show abstract

This study explores integrating machine learning into electronic medical record systems to predict stability of inpatient lab tests. A smart alerts system was developed and tested at Stanford Hospital. The system identifies stable lab results, advising clinicians on test ordering. Live deployment showed desired precision at good recall in predicting test result stability, with suggestions for system optimization identified. This approach may significantly decrease low-yield testing and enhance personalized clinical decision-making.